Building an efficient OCR system for historical documents with little training data
نویسندگان
چکیده
منابع مشابه
Efficient OCR Training Data Generation with Aletheia*
We present how the ground-truthing tool Aletheia can be used to efficiently create training data for an open-source text recognition engine. The labelling process is sped up considerably through a top-down approach. Text content is thereby entered on region level. The characters are then propagated automatically to glyph objects. In addition, segmentation is simplified by several semi-automated...
متن کاملAn OCR System for Printed Documents
This paper describes the general structure of a full automated document analysis system for printed documents. The system is based on a character preclassification stage which reduces the number of patterns to recognize and introduces a new contextual processing. This specific approach for multifont printed documents reading is based on pattern character redundancies. With the study of prototyp...
متن کاملOCR binarization and image pre-processing for searching historical documents
We consider the problem of document binarization as a pre-processing step for optical character recognition (OCR) for the purpose of keyword search of historical printed documents. A number of promising techniques from the literature for binarization, pre-filtering, and post-binarization denoising were implemented along with newly developed methods for binarization: an error diffusion binarizat...
متن کاملAutomatic Assessment of OCR Quality in Historical Documents
Mass digitization of historical documents is a challenging problem for optical character recognition (OCR) tools. Issues include noisy backgrounds and faded text due to aging, border/marginal noise, bleed-through, skewing, warping, as well as irregular fonts and page layouts. As a result, OCR tools often produce a large number of spurious bounding boxes (BBs) in addition to those that correspon...
متن کاملA Hybrid Framework for Building an Efficient Incremental Intrusion Detection System
In this paper, a boosting-based incremental hybrid intrusion detection system is introduced. This system combines incremental misuse detection and incremental anomaly detection. We use boosting ensemble of weak classifiers to implement misuse intrusion detection system. It can identify new classes types of intrusions that do not exist in the training dataset for incremental misuse detection. As...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Neural Computing and Applications
سال: 2020
ISSN: 0941-0643,1433-3058
DOI: 10.1007/s00521-020-04910-x